<!--begin.rcode results='hide', echo=FALSE, message=FALSE

library(caret)
data(BloodBrain)
library(parallel)
library(doMC)
registerDoMC(cores=detectCores()-1)
library(randomForest)

theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4), 
                                rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.cex = .8
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.cex = .8
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1

hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
  if (is.character(x)) highr::hi_html(x) else hook_inline(x)
  })
opts_chunk$set(comment=NA, digits = 3)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
end.rcode--> 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License

Name       : Emerald 
Description: A two-column, fixed-width design with dark color scheme.
Version    : 1.0
Released   : 20120902

-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Feature Selection using Genetic Algorithms</title>
<link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
<link href="style.css" rel="stylesheet" type="text/css" media="screen" />
</head>
<body>
<div id="wrapper">
<div id="header-wrapper" class="container">
<div id="header" class="container">
<div id="logo">
<h1><a href="#">Feature Selection using Genetic Algorithms</a></h1>
</div>
<!--
<div id="menu">
<ul>
<li class="current_page_item"><a href="#">Homepage</a></li>
<li><a href="#">Blog</a></li>
<li><a href="#">Photos</a></li>
<li><a href="#">About</a></li>
<li><a href="#">Contact</a></li>
</ul>
</div>
-->
</div>
<div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
</div>
<!-- end #header -->
<div id="page">
<div id="content">

<h1>Contents</h1>  
<ul>
<li><a href="#ga">Genetic Algorithms</a></li> 
<li><a href="#performance">Internal and External Performance Estimates</a></li> 
<li><a href="#syntax">Basic Syntax</a></li> 
<li><a href="#example">Example</a></li> 
<li><a href="#custom">Customizing the Search</a></li> 
<li><a href="#example2">The Example Revisited</a></li> 
</ul>   

<div id="ga"></div>
<h1>Genetic Algorithms</h1>

<p>
Genetic algorithms (GAs) mimic Darwinian forces of natural selection to find optimal values of some function (<a href="http://mitpress.mit.edu/books/introduction-genetic-algorithms">Mitchell, 1998</a>). An initial set of candidate solutions are created and their corresponding <i>fitness</i> values are calculated (where larger values are better). This set of solutions is referred to as a population and each solution as an <i>individual</i>. The individuals with the best fitness values are combined randomly to produce offsprings which make up the next population. To do so, individual are selected and undergo cross-over (mimicking genetic reproduction) and also are subject to random mutations. This process is repeated again and again and many generations are produced (i.e. iterations of the search procedure) that should create better and better solutions. 
</p>
<p>
For feature selection, the individuals are subsets of predictors that are encoded as binary; a feature is either included or not in the subset. The fitness values are some measure of model performance, such as the RMSE or classification accuracy. One issue with using GAs for feature selection is that the optimization process can be very aggressive and their is potential for the GA to overfit to the predictors (much like the previous discussion for RFE). 
</p>


<div id="performance"></div>
<h1>Internal and External Performance Estimates</h1>

<p>
The genetic algorithm code in <a href="http://cran.r-project.org/package=caret"><strong>caret</strong></a> conducts the search of the feature space repeatedly within resampling iterations. First, the training data are split be whatever resampling method was specified in the control function. For example, if 10-fold cross-validation is selected, the entire genetic algorithm is conducted 10 separate times. For the first fold, nine tenths of the data are used in the search while the remaining tenth is used to estimate the external performance since these data points were not used in the search. 
</p>
<p>
During the genetic algorithm, a measure of fitness is needed to guide the search. This is the internal measure of performance. During the search, the data that are available are the instances selected by the top-level resampling (e.g. the nine tenths mentioned above). A common approach is to conduct another resampling procedure. Another option is to use a holdout set of samples to determine the internal estimate of performance (see the holdout argument of the control function). While this is faster, it is more likely to cause overfitting of the features and should only be used when a large amount of training data are available. Yet another idea is to use a penalized metric (such as the AIC statistic) but this may not exist for some metrics (e.g. the area under the ROC curve). 
</p>
<p>
The internal estimates of performance will eventually overfit the subsets to the data. However, since the external estimate is not used by the search, it is able to make better assessments of overfitting. After resampling, this function determines the optimal number of generations for the GA. 
</p>
<p>
Finally, the entire data set is used in the last execution of the genetic algorithm search and the final model is built on the predictor subset that is associated with the optimal number of generations determined by resampling (although the update function can be used to manually set the number of generations). 
</p>

<div id="syntax"></div>
<h1>Basic Syntax</h1>

<p>
The most basic usage of the function is:
</p>

<!--begin.rcode ga_syntax_basic,eval=FALSE
obj <- gafs(x = predictors, 
            y = outcome,
            iters = 100)
end.rcode--> 

<p>
where
</p>
<ul>
  <li> <span class="mx arg">x</span>: 
  a data frame or matrix of predictor values
  </li> 
  <li> <span class="mx arg">y</span>: 
  a factor or numeric vector of outcomes
  </li> 
  <li> <span class="mx arg">iters</span>: 
  the number of generations for the GA
  </li> 
</ul>
<p>
This isn't very specific. All of the action is in the control function. That can be used to specify the model to be fit, how predictions are made and summarized as well as the genetic operations. 
</p>
<p>
Suppose that we want to fit a linear regression model. To do this, we can use <span class="mx funCall">train</span> as an interface and pass arguments to that function through <span class="mx funCall">gafs</span>:
</p>
<!--begin.rcode ga_syntax_basic_lm,eval=FALSE
ctrl <- gafsControl(functions = caretGA)
obj <- gafs(x = predictors, 
            y = outcome,
            iters = 100,
            gafsControl = ctrl,
            ## Now pass options to `train`
            
            method = "lm")
end.rcode--> 

<p>
Other options, such as <span class="mx arg">preProcess</span>, can be passed in as well. 
</p>

<p>
Some important options to <span class="mx funCall">gafsControl</span> are:
</p>
<ul>
  <li> <span class="mx arg">method</span>, <span class="mx arg">number</span>, <span class="mx arg">repeats</span>, <span class="mx arg">index</span>, <span class="mx arg">indexOut</span>, etc: 
  options similar to those for <a href="http://topepo.github.io/caret/model-training-and-tuning.html#control"><span class="mx funCall">train</span></a> top control resampling.
  </li> 
  <li> <span class="mx arg">metric</span>: 
  this is similar to <a href="http://topepo.github.io/caret/model-training-and-tuning.html#control"><span class="mx funCall">train</span></a>'s option but, in this case, the value should be a named vector with values for the internal and external metrics. If none are specified, the first value returned by the summary functions (see details below) are used and a warning is issued. A similar two-element vector for the option <span class="mx arg">maximize</span> is also required. See the <a href="#example2">last example here</a> for an illustration.
  </li> 
  <li> <span class="mx arg">holdout</span>: 
  this is a number between [0, 1) that can be used to hold out samples for computing the internal fitness value. Note that this is independent of the external resampling step. Suppose 10-fold CV is being used. Within a resampling iteration, <span class="mx arg">holdout</span> can be used to sample an additional proportion of the 90% resampled data to use for estimating fitness. This may not be a good idea unless you have a very large training set and want to avoid an internal resampling procedure to estimate fitness. 
  </li> 
  <li> <span class="mx arg">allowParallel</span> and <span class="mx arg">genParallel</span>: 
  these are logicals to control where parallel processing should be used (if at all). The former will parallelize the external resampling while the latter parallelizes the fitness calculations within a generation. <span class="mx arg">allowParallel</span> will almost always be more advantageous. 
  </li>  
</ul>
<p>
There are a few built-in sets of functions to use with <span class="mx funCall">gafs</span>: <code>caretGA</code>, <code>rfGA</code>, and <code>treebagGA</code>. The first is a simple interface to 
<span class="mx funCall">train</span>. When using this, as shown above, arguments can be passed to <span class="mx funCall">train</span> using the <span class="mx arg">...</span> structure and the resampling estimates of performance can be used as the internal fitness value. The functions provided by  <code>rfGA</code> and <code>treebagGA</code> avoid using <span class="mx funCall">train</span> and their internal estimates of fitness come from using the out-of-bag estimates generated from the model. 
</p>
<p>
The GA implementation in <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> uses the underlying code from the <a href="http://cran.r-project.org/package=GA"><strong>GA</strong></a>  package (<a href="http://www.jstatsoft.org/v53/i04/">Scrucca, 2013</a>).
</p>


<div id="example"></div>
<h1>Example</h1>

<p>
Using the example from the <a href="rfe.html#example">previous page</a> where there are five real predictors and 40 noise predictors:
</pa>

<!--begin.rcode ga_load_sim,echo=TRUE,message=FALSE
library(mlbench)
n <- 100
p <- 40
sigma <- 1
set.seed(1)
sim <- mlbench.friedman1(n, sd = sigma)
colnames(sim$x) <- c(paste("real", 1:5, sep = ""),
                     paste("bogus", 1:5, sep = ""))
bogus <- matrix(rnorm(n * p), nrow = n)
colnames(bogus) <- paste("bogus", 5+(1:ncol(bogus)), sep = "")
x <- cbind(sim$x, bogus)
y <- sim$y
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)
end.rcode--> 

<p>
We'll fit a random forest model and use the out-of-bag RMSE estimate as the internal performance metric and use the same repeated 10-fold cross-validation process used with the search. To do this, we'll use the built-in <code>rfGA</code> object for this purpose. The default GA operators will be used and conduct 200 generations of the algorithm.
</p>
 
<!--begin.rcode ga_rf,cache=TRUE
ga_ctrl <- gafsControl(functions = rfGA,
                       method = "repeatedcv",
                       repeats = 5)

## Use the same random number seed as the RFE process
## so that the same CV folds are used for the external
## resampling. 
set.seed(10)
rf_ga <- gafs(x = x, y = y,
              iters = 200,
              gafsControl = ga_ctrl)
rf_ga
end.rcode--> 

<p>
With 5 repeats of 10-fold cross-validation, the GA was executed 50 times. The average external performance is calculated across resamples and these results are used to determine the optimal number of iterations for the final GA to avoid over-fitting. Across the resamples, an average of <!--rinline I(round(mean(unlist(lapply(rf_ga$resampled_vars, length))), 1)) --> predictors were selected at the end of each of the algorithms. 
</p>
<p>
The <span class="mx funCall">plot</span> function is used to monitor the average of the internal out-of-bag RMSE estimates as well as the average of the external performance estimates calculated from the 50 out-of-sample predictions. By default, this function uses <a href="http://cran.r-project.org/package=ggplot2"><strong>ggplot2</strong></a> package. A black and white theme can be "added" to the output object:
</p>

<!--begin.rcode ga_rf_profile,fig.width=8,fig.height=4
plot(rf_ga) + theme_bw()
end.rcode--> 

<p>
Based on these results, the generation associated with the best external RMSE estimate was <!--rinline I(rf_ga$optIter) --> with a corresponding RMSE estimate of <!--rinline I(round(mean(subset(rf_ga$external, Iter == rf_ga$optIter)$RMSE), 2)) -->. 
</p>
<p>
Using the entire training set, the final GA is conducted and, at generation <!--rinline I(rf_ga$optIter) -->, there were <!--rinline I(length(rf_ga$optVariables)) --> that were selected: <!--rinline I(paste(rf_ga$optVariables, sep = "", collapse = ", ")) -->. The random forest model with these predictors is created using the entire training set is trained and this is the model that is used when <span class="mx funCall">predict.gafs</span> is executed. 
</p>

<p>
<b>Note:</b> the correlation between the internal and external fitness values is somewhat atypical for most real-world problems. This is a function of the nature of the simulations (a small number of uncorrelated informative predictors) and that the OOB error estimate from random forest is a product of hundreds of trees. Your mileage may vary. 
</b>


<div id="custom"></div>
<h1>Customizing the Search</h1>

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">fit</span> Function</p></i></b>

<p>
This function builds the model based on a proposed current subset. The arguments for the function must be:
</p>

<ul>
<li> <span class="mx arg">x</span>: the current training set of predictor data with the appropriate subset of variables</li> 
<li> <span class="mx arg">y</span>: the current outcome data (either a numeric or factor vector)</li> 
<li> <span class="mx arg">lev</span>: a character vector with the class levels (or <code>NULL</code> for regression problems)</li> 
  <li> <span class="mx arg">last</span>: a logical that is <span class="mx arg">TRUE</span> when the final GA search is conducted on the entire data set </li> 
<li> <span class="mx arg">...</span>: optional arguments to pass to the fit function in the call to <span class="mx funCall">gafs</span></li> 
</ul>

<p>
The function should return a model object that can be used to generate predictions. For random forest, the fit function is simple:
</p>

<!--begin.rcode ga_fit
rfGA$fit
    end.rcode--> 

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">pred</span> Function</p></i></b>

<p>
This function returns a vector of predictions (numeric or factors) from the current model . The input arguments must be
</p>

<ul>
<li> <span class="mx arg">object</span>: the model generated by the <span class="mx funCall">fit</span> function</li> 
<li> <span class="mx arg">x</span>: the current set of predictor set for the held-back samples</li> 
</ul>

<p>
For random forests, the function is a simple wrapper for the predict function:
</p>

<!--begin.rcode ga_pred
rfGA$pred
    end.rcode--> 

<p>
For classification, it is probably a good idea to ensure that the resulting factor variables of predictions has the same levels as the input data.
</p>

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">fitness_intern</span> Function</p></i></b>

<p>
The <span class="mx funCall">fitness_intern</span> function takes the fitted model and computes one or more performance metrics. The inputs to this function are:
</p>

<ul>
<li> <span class="mx arg">object</span>: the model generated by the <span class="mx funCall">fit</span> function</li> 
<li> <span class="mx arg">x</span>: the current set of predictor set. If the option <span class="mx funCall">gafsControl</span><code>$</code><span class="mx arg">holdout</span> is zero, these values will be from the current resample (i.e. the same data used to fit the model). Otherwise, the predictor values are from the hold-out set created by <span class="mx funCall">gafsControl</span><code>$</code><span class="mx arg">holdout</span>.
</li> 
<li> <span class="mx arg">y</span>:  outcome values. See the note for the <span class="mx arg">x</span> argument to understand which data are presented to the function. </li>
<li> <span class="mx arg">maximize</span>: a logical from <span class="mx funCall">gafsControl</span> that indicates whether the metric should be maximized or minimized</li>
<li> <span class="mx arg">p</span>: the total number of possible predictors</li>
</ul>

<p>
The output should be a <b>named</b> numeric vector of performance values.
</p>
<p>
In many cases, some resampled measure of performance is used. In the example above using random forest, the OOB error was used. In other cases, the resampled performance from <span class="mx funCall">train</span> can be used and, if <span class="mx funCall">gafsControl</span><code>$</code><span class="mx arg">holdout</span> is not zero, a static hold-out set can be used. This depends on the data and problem at hand. If left 
</p>
<p>
The example function for random forest is:
</p>

<!--begin.rcode ga_int_func
rfGA$fitness_intern
    end.rcode--> 

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">fitness_extern</span> Function</p></i></b>

<p>
The <span class="mx funCall">fitness_extern</span> function takes the observed and predicted values form the external resampling process and computes one or more performance metrics. The input arguments are:
</p>
<ul>
<li> <span class="mx arg">data</span>: a data frame or predictions generated by the <span class="mx funCall">fit</span> function. For regression, the predicted values in a column called <code>pred</code>. For classification, <code>pred</code> is a factor vector. Class probabilities are usually attached as columns whose names are the class levels (see the random forest example for the <span class="mx funCall">fit</span> function above)
</li> 
<li> <span class="mx arg">lev</span>: a character vector with the class levels (or <code>NULL</code> for regression problems)</li> 
</ul>

<p>
The output should be a <b>named</b> numeric vector of performance values.
</p>

<p>
The example function for random forest is:
</p>

<!--begin.rcode ga_ext_func
rfGA$fitness_extern
    end.rcode--> 


<p>
Two functions in <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> that can be used as the summary
function are <span class="mx funCall">defaultSummary</span> and <span class="mx funCall">twoClassSummary</span> (for
classification problems with two classes). 
</p>

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">initial</span> Function</p></i></b>

<p>
This function creates an initial generation. Inputs are:
</p>

<ul>
<li> <span class="mx arg">vars</span>: the number of possible predictors</li> 
<li> <span class="mx arg">popSize</span>: the population size for each generation</li>
<li> <span class="mx arg">...</span>: not currently used</li> 
</ul>

<p>
The output should be a binary 0/1 matrix where there are <code>vars</code> columns corresponding to the predictors and <code>popSize</code> rows for the individuals in the population. 
</p>
<p>
The default function populates the rows randomly with subset sizes varying between 10% and 90% of number of possible predictors. For example:
</p>.

<!--begin.rcode ga_init
set.seed(128)
starting <- rfGA$initial(vars = 12, popSize = 8)
starting
apply(starting, 1, mean)
    end.rcode--> 

<p>
<span class="mx funCall">gafs</span> has an argument called <span class="mx arg">suggestions</span> that is similar to the one in the <span class="mx funCall">ga</span> function where the initial population can be seeded with specific subsets.  
</p>.

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">selection</span> Function</p></i></b>


<p>
This function conducts the genetic selection. Inputs are:
</p>

<ul>
<li> <span class="mx arg">population</span>: the indicators for the current population</li> 
<li> <span class="mx arg">fitness</span>: the corresponding fitness values for the population. Note that if the internal performance value is to be minimized, these are the negatives of the actual values</li>
<li> <span class="mx arg">r</span>, <span class="mx arg">q</span>: tuning parameters for specific selection functions. See <span class="mx funCall">gafs_lrSelection</span> and <span class="mx funCall">gafs_nlrSelection</span></li>
<li> <span class="mx arg">...</span>: not currently used</li> 
</ul>

<p>
The output should be a list with named elements. 
</p>

<ul>
<li> <span class="mx arg">population</span>: the indicators for the selected individuals </li> 
<li> <span class="mx arg">fitness</span>: the fitness values for the selected individuals</li>
</ul>

<p>
The default function is a version of the <a href="http://cran.r-project.org/package=GA"><strong>GA</strong></a> package's <span class="mx funCall">ga_lrSelection<span> function. 
</p>.

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">crossover</span> Function</p></i></b>

<p>
This function conducts the genetic crossover. Inputs are:
</p>

<ul>
<li> <span class="mx arg">population</span>: the indicators for the current population</li> 
<li> <span class="mx arg">fitness</span>: the corresponding fitness values for the population. Note that if the internal performance value is to be minimized, these are the negatives of the actual values</li>
<li> <span class="mx arg">parents</span>: a matrix with two rows containing indicators for the parent individuals. </li>
<li> <span class="mx arg">...</span>: not currently used</li> 
</ul>

<p>
The default function is a version of the <a href="http://cran.r-project.org/package=GA"><strong>GA</strong></a> package's <span class="mx funCall">ga_spCrossover</span> function. Another function that is a version of that package's uniform cross-over function is also available. 
</p>.

<p>
The output should be a list with named elements. 
</p>

<ul>
<li> <span class="mx arg">children</span>: from <code>?</code><span class="mx funCall">ga_spCrossover</span>: "a matrix of dimension 2 times the number of decision variables containing the generated offsprings" </li> 
<li> <span class="mx arg">fitness</span>: "a vector of length 2 containing the fitness values for the offsprings. A value <code>NA</code> is returned if an offspring is different (which is usually the case) from the two parents."</li>
</ul>

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">mutation</span> Function</p></i></b>


<p>
This function conducts the genetic mutation. Inputs are:
</p>

<ul>
<li> <span class="mx arg">population</span>: the indicators for the current population</li> 
<li> <span class="mx arg">parents</span>: a vector of indices for where the mutation should occur. </li>
<li> <span class="mx arg">...</span>: not currently used</li> 
</ul>

<p>
The default function is a version of the <a href="http://cran.r-project.org/package=GA"><strong>GA</strong></a> package's <span class="mx funCall">gabin_raMutation</span> function. 
</p>.

<p>
The output should the mutated population.
</p>

<!-- ---------------------------------------------------------------------------------------- -->

<b><p><i>The <span class="mx funCall">selectIter</span> Function</p></i></b>

<p>
This function determines the optimal number of generations based on the resampling output. Inputs for the function are:
</p>

<ul>
<li> <span class="mx arg">x</span>: a matrix with columns for the performance metrics averaged over resamples</li> 
<li> <span class="mx arg">metric</span>: a character string of the performance measure to optimize (e.g. RMSE, Accuracy)</li> 
<li> <span class="mx arg">maximize</span>: a single logical for whether the metric should be maximized</li> 
</ul>

<p>
This function should return an integer corresponding to the optimal subset size.
</p>

<div id="example2"></div>
<h1>The Example Revisited</h1>

<p>
The previous GA included some of the non-informative predictors. We can cheat a little and try to bias the search to get the right solution. 
</p>

<p>
We can try to encourage the algorithm to choose fewer predictors, we can penalize the the RMSE estimate. Normally, a metric like the Akaike information criterion (AIC) statistic would be used. However, with a random forest model, there is no real notion of model degrees of freedom. As an alternative, we can use <a href="http://scholar.google.com/scholar?q=%22desirability+functions">desirability functions</a> to penalize the RMSE. To do this, two functions are created that translate the number of predictors and the RMSE values to a measure of "desirability". For the number of predictors, the most desirable property would be a single predictor and the worst situation would be if the model required all 50 predictors. That desirability function is visualized as:
</p>

<!--begin.rcode ga_pred_d,fig.width=6,fig.height=4,echo=FALSE
pred_d <- data.frame(Predictors = 0: 52)
pred_d$Desirability <- predict(dMin(1, 50), pred_d$Predictors)
ggplot(pred_d, aes(x = Predictors, y = Desirability)) + geom_line() + theme_bw()
end.rcode--> 

<p>
For the RMSE, the best case would be zero. Many poor models have values around four. To give the RMSE value more weight in the overall desirability calculation, we use a scale parameter value of 2. This desirability function is:
</p>

<!--begin.rcode ga_rmse_d,fig.width=6,fig.height=4,echo=FALSE
rmse_d <- data.frame(RMSE = seq(-.25, 4.24, length = 100))
rmse_d$Desirability <- predict(dMin(0, 4, 2), rmse_d$RMSE)
ggplot(rmse_d, aes(x = RMSE, y = Desirability)) + geom_line() + theme_bw()
end.rcode--> 

<p>
To use the overall desirability to drive the feature selection, the <span class="mx funCall">internal</span> function requires replacement. We make a copy of <code>rfGA</code> and add code using the <a href="http://cran.r-project.org/package=desirability"><strong>desirability</strong></a> package and the function returns the estimated RMSE and the overall desirability. The <span class="mx funCall">gafsControl</span> function also need changes. The <span class="mx arg">metric</span> argument needs to reflect that the overall desirability score should be maximized internally but the RMSE estimate should be minimized externally. 
</p>

<!--begin.rcode ga_rf_2,cache=TRUE
library(desirability)
rfGA2 <- rfGA
rfGA2$fitness_intern <- function (object, x, y, maximize, p) {
  RMSE <- rfStats(object)[1]
  d_RMSE <- dMin(0, 4)
  d_Size <- dMin(1, p, 2)
  overall <- dOverall(d_RMSE, d_Size)
  D <- predict(overall, data.frame(RMSE, ncol(x)))
  c(D = D, RMSE = as.vector(RMSE))
  }
ga_ctrl_d <- gafsControl(functions = rfGA2,
                         method = "repeatedcv",
                         repeats = 5,
                         metric = c(internal = "D", external = "RMSE"),
                         maximize = c(internal = TRUE, external = FALSE))

set.seed(10)
rf_ga_d <- gafs(x = x, y = y,
                iters = 200,
                gafsControl = ga_ctrl_d)

rf_ga_d
end.rcode--> 

<p>
Here are the RMSE values for this search:
</p>


<!--begin.rcode ga_rf_profile_2,fig.width=8,fig.height=4
plot(rf_ga_d) + theme_bw()
end.rcode--> 

<p>
The final GA is found <!--rinline I(length(rf_ga_d$optVariables)) --> that were selected: <!--rinline I(paste(rf_ga_d$optVariables,  sep = "", collapse = ", ")) -->. During resampling, the average number of predictors selected was <!--rinline I(round(mean(unlist(lapply(rf_ga_d$resampled_vars, length))), 1)) -->, indicating that the penalty on the number of predictors was effective. 
</p>


<!-- ------------------------ end #content------------------------ -->

<div style="clear: both;">&nbsp;</div>
</div>
<!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;">&nbsp;</div>
</div>
<div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
<!-- end #page -->
</div>
<div id="footer-content"></div>
<div id="footer">
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)    
end.rcode-->   
<p>Created on <!--rinline I(session) -->.</p>
</div>
<!-- end #footer -->
</body>
</html>



